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1.  INTRODUCTION 

Several  attempts  have  been  made  recently  to  use  linear  prediction  analysis  of  speech  for  isolated  word(ref.l  and 
spoken  digit  recognition(ref.3,4).  The  feature  chosen  for  the  recognition  algorithm  in  these  studies  was  the  set  of 
linear  prediction  coefficients.  It  is  well  known  that  an  estimate  of  the  vocal  tract  area  function  can  be  derived  from 
these  coefficientsfref. 5,6,7)  and  the  present  paper  arose  from  a study  of  the  usefulness  of  this  function  for  both  speech 
and  voice  recognition.  Because  of  the  extensive  information  available  from  phonetic  and  articulatory  studies  of  speech 
production  it  was  believed  that  the  vocal  tract  area  function  (VTAF)  would  be  an  advantageous  feature  for  the  pattern 
recognilion  process.  To  test  this  idea,  and  also  to  compare  the  various  formulations  of  the  linear  prediction  modeb, 
it  was  decided  to  display  the  VTAF  as  an  Intensity-modulated  picture  of  vocal  tract  position  versus  time,  with  the  area 
plotted  as  a grey-level.  Ttds  is  of  course  a similar  disphy  to  the  well  known  spectrogram.  We  shall  refer  to  these 
displays  as  VTAF  pictures. 

Figure  1 shows  a typical  VTAF  picture  produced  from  real  speech.  The  most  obvious  feature  of  this  picture  b the 
strong  pulsations  seen,  for  example,  at  the  intervab  labelled  2,  S and  8.  It  is  believed  these  pulsations  are  arti&cts  of 
the  analysis  as  no  evidence  of  them  is  apparent  in  the  time  series  or  spectrogram. 

Before  discussing  thb  phenomenon  in  more  detail  we  describe  briefly  the  production  of  the  pictures. 

2.  EXPERIMENTAL  RESULTS 

2.1  The  VTAF  picture 

The  first  linear  prediction  model  used  in  this  study  was  that  due  to  Wakita(ref.7).  This  is  so-called  auto- 
correlation technique  and  was  chosen  because,  for  non  pitch-synchronous  analyse,  these  formulations  are 
generally  more  stable  and  robust  than  the  “covariance**  methods  although  for  pitch-synchronous  analyse  the 
latter  are  capable  of  giving  better  estimates  of  the  actual  vocal  tract(ref.8,9). 

Suppose  that  the  anti-aliasing  filtered  speech  signal  is  sampled  at  frequency  f^  « 1/T,  and  that  n^  samples  are 

included  in  each  autocorrelation  window  and  that  a new  computation  of  the  VTAF  b made  every  n^  samples. 

If  lUi  linear  prcdiclion  ccKffldcnts  arc  used  then  m^  m|  -t- 1 vocal  tract  areas  are  produced  at  time  intervals  of 

I where 
c 

t = n T (1) 

c c ' 

Denoting  the  array  of  vocal  tract  areas  a.(t)  obtained  at  time  t as  a vector ^t)  we  have 

a(t)  * a,  (t),  aj(t),  a,(t) ....  (2) 


In  n successive  estimates  of  a(t)  are  evaluated  then  the  resuhing  sets  of  these  a(t)  may  be  regarded  as  an 
(m  X n ) rrratrix 
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This  matrix  can  be  plotted  as  an  (m  x n)  digital  picture  where  the  grey  levels  are  assigned  by  some  mapping 
from  the  values  of  the  elements  of  A to  the  set  of  grey  levels. 

To  produce  flgure  1 the  vahiea  f^  = 8192  Hz,  m^  = 9,  n^  = 64,  n^  = 64  and  n = 1024  were  used.  Now  a 

9 X 1 024  picture  is  a very  cumbersome  shape  and  so  tliis  was  split  into  eight  9 x 128  subpictures,  which  for  display 
purposes  were  interpolated  (by  a two-dimensional  Fast  Fourier  Transform)  into  eight  36  x S12  subpictures. 

Tliose  eight  subpictures  were  plotted,  one  below  the  other,  as  in  flgure  1 on  an  intensity-modulated  CRT.  The 
bottom  of  each  subpicture  represents  the  glottis  and  the  top  the  tips.  The  grey  levels  have  been  assigned  such 
that  the  larger  the  area  the  greater  the  whiteness.  Tlius  the  point  of  maximum  constriction  is  the  darkest  region 
in  each  column.  It  should  be  noted  that  the  Wakita  model  assumes  a comtant  glottis  area  and  thus  the  lower 
edge  of  each  picture  is  a constant  grey  level.  Some  regions  of  the  picture  are  blank.  This  is  due  to  use  of  an 
energy-detecting  algorithm  which  assigns  arbitrary  zero  levels  to  the  VTAF’s  when  the  total  signal  occuning  in 
the  time  series  window  is  below  a given  threshold  (as  during  silences  between  utterances).  Each  subpicture 
represents  1 s of  real  time  and  thus  8 s is  shown  overall. 

2.2  Observations 

In  figure  I the  occurrence  of  periods  of  pulsations  is  easily  observed;  the  most  obvious  of  these  are  indicated  In 
the  picture.  The  utterance  shown  in  this  picture  is  the  phrase  “Speak  to  me  now,  bad  kangaroo!”  repeated  three 
times  by  an  Australian  female  speaker.  The  observed  pulsations  occur  during  constant  vowel  segments  where 
little  or  no  real  change  in  the  vocal  tract  is  occurring. 

This  is  supported  by  an  examination  of  the  speech  time  series  conesponding  to  the  utterance.  Figure  2(a)  shows 
the  speech  waveform  corresponding  to  the  /ae  / in  ‘bad’  which  for  this  speaker  is  remarkably  stationary.  The 
corresponding  VTAF’s  are  plotted  in  flgure  2(b)  and  are  clearly  fluctuating,  an  effect  which  does  not  auger  well 
for  using  the  VTAFs  in  any  automatic  speech  recognition  process.  The  plots  in  flgure  2(b)  are  in  fact  the  square- 
root  of  the  VTAF  and  thus  the  estimates  of  area  actually  vary  by  the  order  of  9:1  during  this  apparently  stationary 
segment . 

We  had  not  observed  this  phenomenon  previously,  even  thougli  several  VTAF  pictures  of  Australian  male  speakers 
repeating  the  same  phrase  had  been  made.  This  suggested  that  the  phenomenon  may  be  sensitive  to  pitch  period. 

Now  for  linear  prediction  models  of  the  Wakita  type  the  analysis  begins  by  windowing  the  time  series  by  a 
Hanning  weighting  function.  The  only  parameter  which  is  changing  during  a stationary  segment  is  thus  the 
position  of  the  glottal  pulse  within  the  Hanning  window  (unless  the  analysis  is  pitch-synchronous).  We  shall  now 
examine  this  effect  and  show  it  can  cause  the  observed  phenomenon. 


.3.  ANALYSIS  OF  WINDOW  POSITION  EFFECTS 

We  examine  the  effect  of  the  time  relationship  between  the  autocorrelation  window  and  the  speech  waveform  (figure  3). 
To  facilitate  analysis  we  use  a model  comprising  a second  order  (two  junction,  three  section)  vocal  tract  yielding  an 
impulse  response  of  the  form 


h(n)  = r"  cos  neJT,  n = 0,1 ,2 (4) 

We  use  a window  of  the  form 

w(ii)  ■ V4  ♦ V4  c»»  “"y  . n ■ ■ ^ lo  (5) 


where  d is  the  delay  whose  effect  is  of  interest,  and  N the  duration  of  the  window  is  large  compared  with  l/(r-l),i.e.  the 
interval  over  which  the  impulse  response  lias  significant  magnitude.  The  excitation  is  taken  to  be  a unit  pulse,  and  thus 
the  model  speech  signal  s(n)  is  the  same  as  h(n). 

This  model  is  not  realistic,  but  it  dues  aid  our  appreciation  of  effects  which  can  arise,  and  yields  a sufficient 
explanation  of  our  observations. 

Figure  .1  shows  effects  on  the  windowed  signal  s^(n)  variation  of  the  delay.  Two  cases  are  shown,  viz. 


A 


Ai 
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(.1)  tViilrcd  window,  in  which  w(n)  i I over  the  cITcctivc  duration  of  i(n),  i.e.  we  have  »^(n)  =s(n) 

(b)  Dciayed  window,  in  which  the  curved  rise  of  w(n)  progressively  magnifies  the  signai,  producing 

a compensation  of  the  damping  of  s(n). 

Figure  4 shows  for  example  the  shape  of  the  window 


w^(n)  = 0.5(l+cos  Y^(n-61)1 


(6) 


compared  with  the  expotentiai  Rri"  with  R = 0.0t4andrt  = 1 .26,  which  were  chosen  in  an  ad  hoc  manner  simply 
for  demonstration  purposes.  The  expotential  appears  to  be  a reasonable  approximation  to  w^(n). 

We  see  that  for  s(n)  of  the  form 

s(n)  = r"  cos  naff  (7) 

% 

with  r < I the  delayed  window  would  cause  s^(n)  to  be  approximately 


s^(n)  = s(n)w^(n)=  R (r^rj )"  cos  neJT 


(8) 


Now,  for  speech  sampled  at  10  000  s'* , the  value  of  r^  is  likely  to  be  in  the  range  0.985  to  0.9  (i.e.  approximately 
50  II/.  to  .300  Hz  formant  bandwidth  respectively).  Thus,  the  apparent  value  of  the  ratio  corresponding  to  r^  in  s(n) 
as  given  by  (7)  becomes  tiic  value  r^ra  in  (8)  and  tiic  iatter  may  be  grossly  in  error,  and  even  exceed  unity  as  with  the 
e.xaiiiplc  values  of  ri  = l.26andr^  = 0.9. 

Next  we  study  the  effect  of  a discrepancy  in  the  value  of  r^  on  the  area  function  of  a model  vocal  t raei 

Figure  5 shows  an  acoustic  tube  (or  transmission  line  model  of  a vocal  traci  in  which  there  are  three  sections  of  area 

a , m = 0 at  the  lips  end,  1 for  the  middle  section  and  2 at  the  glottai  There  are  thus  two  junctions  whose  volume 
m 

velocity  reflection  cocfRcients  m ■ 1 and  2 are  given  by  (ref.lO) 


M 


m 


a .1  - a 
m * m 

a .1  + a 
m m 


(9) 


The  termination  at  the  lips  is  assumed  to  be  equivalent  to  a tube  section  of  infinite  area,  resulting  in  a volume  velocity 
reflection  coefficient  of  -1 . For  co.’'venience  in  analysis  we  associate  all  the  delay  (i.e.  sum  of  delays  for  forward  and 
backward  travelling  waves)  with  the  backward  travelling  wave  in  each  section.  The  physical  model  of  figure  5(a)  may 
then  be  represented  by  the  signal  flow  model  of  figure  5(b). 

Analysis  of  this  model  shows  that  the  transfer  function  H(z)  = Uj^(z)AJq(z)  is  given  by 


(l+Mi)(lt/ij) 

H(z)  = (10) 

l+z-'  Ml  (l+/ia)  '*‘*  Ih 


4aoa, 


i 

i 


(11) 
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The  impube  response  of  this  system  is  of  the  form 


h(n)  = h(0)  r cos  nuiT 


h(0)., 


aj  + aj  ’ 


cos  uiT  =*  *7,  > 


*0  *1 
) ) 


2r  a,  ♦ a,  ao  ♦ a, 


From  ( 1 4)  we  see  that,  for  this  two  junction  model,  the  (lamping  ratio  r depends  only  on  the  reflection  coefficient  of 
the  junction  closest  to  the  glottis,  i.e.  on  the  area  ratio  at  this  junction.  We  might  query  the  physical  meaning  of  the 
possibiliiy  that  aj  • a|  <0  i.e.,  r*  <0  in  (14).  Detailed  analysis  shows  that  the  impulse  response  is  then  not  oscillatory, 
corresponds  to  real  poles,  and  is  not  of  interest  in  the  present  study. 

To  apply  this  two  junction  model  to  realistic  speech  parameters  we  set  the  length  of  each  section  equal  to  half  the 
length  of  the  vocal  tract,  i.e.  about  9 cm.  The  tie  in  with  the  previous  discussion  of  the  10  000  Hz  sampling  rate,  it  is 
convenient  to  let  each  of  the  sections  be  equivalent  to  an  each  way  delay  of  3 x 10'^  s.  The  delay  T in  the  second 
order  model  described  by  equation  (8)  is  thus  6x10"'*  s,  and  the  relevant  values  of  r for  use  in  these  equations  are 


(r  ri or  (r . Of  course  this  change  in  fact  replaces  the  second  order  system  by  a 1 2th  order  system  if  the  original 


sampling  rate  is  maintained,  since  denominator  factors  of  H(z)  in  (1 2)  of  the  form 


(1  - /•'  r (j^^) 


are  replaced  by  factors  of  the  form 


(1 -Z-®  r*  . 


Each  of  these  factors  results  in  6 poles,  but  the  base  pole  of  each  is  the  same  as  previously,  i.e.  at  z ••  re^**^. 
From  (14)  we  And 


aj  = 1 .ri 


(18) 
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From  (17)  and  also  (18)  we  see  that  the  proportional  variation  of  either  area  aj  or  aj  with  r,  while  the  other  is  fixed 
becomes  very  great  as  r 1 . We  found  earlier  that  the  effect  of  the  delayed  window  on  the  apparent  damping  was 
sufficient  to  make  r pass  through  unity,  and  thus  the  system  may  incur  such  great  sensitivities.  For  example,  varying  r 


from  O.W  to  0.99  causes—  to  change  from  9.53  to  99.5. 
at 


Clearly  this  effect  is  sufficient  to  account  for  variations  as 


large  as  those  observed  in  Section  2.2. 

For  completeness,  we  study  the  effect  at  the  first  junction.  We  note  that  cJT  is  not  affected  by  the  window  delay 
phenoinenon.  and  thus  we  set  d(cos  cJT)  * 0 wlicn  diffcienliating  ( 1 5). 

Wc  tuiU 


I 

•o  “ «I  TTTT  cos  tJT 


(19) 


and 


dap 

ap 

dr 


l-r^ 

Kl+r^) 


(20) 


which  shows  that  the  area  ratios  at  the  first  junction  are  not  strongly  influenced  by  r. 

Note  also  that  tliere  is  a gross  effect  on  the  initial  value  h(0)  or  the  windowed  response.  Via  (13)  we  see  that  this  can 
affect  the  product  of  the  junction  transmission  coefficients  i.e.  (1  Pi ) (1  Fh  )■  The  effect  on  a particular  feature 
however  is  not  explicit. 

For  more  complex  vocal  tract  models,  the  effects  are  more  complex,  but  we  have  demonstrated  a sufficient  mechanism 
to  account  for  the  observations.  One  previous  study(ref.l  1)  showed  that  under  moderate  variation  of  the  pole  dampings 
in  a 5 pole  signal,  the  resultant  VTAF  retained  its  gross  features,  but  underwent  a gradual  smooth  change. 


4.  KXPliRIMENTAL  DIAGNOSIS 

To  test  these  ideas,  synthetic  vowels  were  generated  (in  the  computer)  using  an  all-pole  filter  and  known  excitation 
function.  Details  of  the  synthesis  algorithm  used  are  given  in  Rogers(ref.l  2).  The  four  poles  used  were  derived  from 
the  values  of  formant  positions  and  bandwidths  given  by  Fant(ref.l3)  (Table  1). 


TABLE  1 . POLE  POSITIONS  IN  TERMS  OF  FORMANT  FREQUENCIES 
AND  BANDWIDTHS  (AFTER  FANT(REF.13)) 


First  formant 

Second  formant 

Third  formant 

Fourth  formant 

Vowel 

freq. 

b- width 

freq. 

b-width 

freq. 

b-width 

freq. 

b-width 

/ae/ 

616 

57 

1072 

72 

2430 

130 

3410 

175 

M 

432 

39 

1959 

95 

2722 

170 

3500 

325 

ni 

222 

60 

2244 

75 

3140 

240 

3700 

230 

Ipl 

510 

54 

900 

65 

2400 

100 

3220 

135 

/V 

231 

69 

615 

50 

2375 

110 

3320 

115 
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['iguit  6 shows  plots  of  the  VTAF’s  obtained  for  the  synthetic  vowel  /ae/  when  the  pitch  period  n^  has  been  made 
equal  to  the  computation  interval  n^.  In  each  of  the  four  columns  however  the  ‘phase’  of  the  excitation  function 

relative  to  the  computation  window  is  different  (as  indicated  in  the  figure).  Clearly  the  areas  calculated  vary  with  this 

phase.  This  means  that  when  n # n the  area  calculated  from  a constant  waveform  will  fluctuate  as  the  position  of 

P 

the  excitation  impulse  changes  within  the  window.  Figure  7 shows  this  happening  when  n ^ 0.8  n for  five  different 

P ^ 


synthetic  vowels. 

The  reason  this  effect  had  not  been  observed  in  previous  VTAF  pictures  of  male  speakers  is  believed  to  be  that  the 
computation  interval  used  (64  samples,  equivalent  to  7.81  mS  at  8192  Hz  sampling  rate)  is  quite  close  to  the  pitch 
perk^  of  the  speakers  analysed.  Thus,  fluctuations  are  not  observed  as  the  excitation  function  remains  in  a nearly 
constant  position  in  the  Hanning  window.  It  should  be  remembered  however  that  the  errors  may  still  be  present  in  the 
analysis  but  not  show  up  as  fluctuations.  For  the  female  speaker  n^  ^ 0.9  n^  and  the  fluctuations  are  obvious 

(figure  I).  This  interpretation  is  supported  by  the  fact  that  fluctuations  did  appear  in  male  VTAF’s  that  have  been 
reprocessed  with  larger  values  of  n^. 


5.  SUPPRESSING  THE  FLUCTUATIONS 

It  appears  from  the  above  discussion  that  a partial  cure  for  the  problem  of  fluctuations  would  be  to  increase  the  size 
of  the  Hanning  window  used  to  estimate  the  autocorrelation  function.  This  should  improve  the  estimate  of  r^.  Figure  8 

shows  the  synthetic  vowel /ae)  as  shown  in  figure  6 but  with  n = 3.5  n . We  see  that  the  variations  are  suppressed. 

w p 

To  test  this  on  real  speech  the  VTAF  picture  (figure  1)  was  reprocessed  with  “ '92  (approximately  3.5  n^  and  the 
result  is  shown  in  figure  9.  The  fluctuations  have  indeed  been  largely  suppressed. 


6.  CONCLUDING  REMARKS 

The  autocorrelation  methods  of  linear  prediction  have  a certain  attraction  in  terms  of  robustness  and  economy  of 
computing  effort.  We  have  shown  that  care  must  be  taken  in  choosing  lengths  for  the  analysis,  but  provided  this  is 
done  then  consistent  estimates  of  the  vocal  tract  area  are  obtained.  If  Hanning  windows  of  length  > 2.5  n^  are 

used,  IIk  resultant  area  functions  appear  to  have  the  robustness  desirable  for  automatic  speech  recognition,  or  for  use 
in  visual  displays  for  speech  training  and  phonetic  studies. 
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